---
title: "Name Frequencies in the Gospels & Acts - Analysis, revised"
author: "Jason Wilson"
date: "`r Sys.Date()`"
output:
  html_document:
    toc: yes
    toc_depth: 3
    toc-location: right
    number_sections: yes
    embed-resources: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(warning = FALSE)

load("C:/RProjects/VanDeWeghe_Gospel_Historicity/Gospels-Acts-Ilan5.RData")
index.keep = ifelse(Ilan.Vol3$Exception == '-', TRUE, FALSE)
Ilan.Vol3r = Ilan.Vol3[index.keep, ] #Ilan3-reduced, for the Revision 17 note, below
```

```{r, include=FALSE}
library(dplyr)
library(ggplot2)
library(gridExtra)
library(tidyr)
library(pwr) #for power analysis
library(devEMF) #for high-res plots
library(readxl)

#Global resampling variables
NREP = 10^1 #This is the number of replicates for all Uniform bootstrap tests.
            #It should only be set to 10^4
            #for the final run.  It dramatically increases the run-time.

GA.size = 52 #This is the sample size of our uniform simulation.  I set it as a global variable because otherwise I'd need to recalculate it inside of simulations for the CI calculations and I wanted to be able to change it, if necessary.

set.seed(2024) #Fix the seed, for reproducibility
```

# Introduction

## Version 17
This document is revised version 17 of the analyses, dating to March 2026, for: 

Van de Weghe, Luuk and Jason Wilson.  2024.  Why Name Popularity is a Good Test of Historicity.  *Journal for the Study of the Historical Jesus*.  June 26, 2024.  DOI: 10.1163/17455197-bja10035
https://brill.com/view/journals/jshj/aop/article-10.1163-17455197-bja10035/article-10.1163-17455197-bja10035.xml

The issue for this revision regards the original scrape of Ilan-3 from the Hayim Lapin GitHub repository.  In the process of later examinination of the data, Luuk discovered some of the Hebrew names were missing.  We verified that the source of the missingness is that the files were absent from the website as of 3/6/26.  We acquired the electronic version of the manuscript and manually went through it.  Seven biblical names were missing, with a total of 112 occurrences. They are added back in the data for this update.  Re-running the scripts, in no case do the original conclusions change, i.e. the p-values remain above or below the benchmark p-value of 0.05/20 = 0.003.  The comparison of updated p-values for our original Table 3 is below:

Name Frequency Results:

1. The bin-checking output changed slightly, but the same bins were optimal
2. GA vs. Ilan3: Old p-value = 4.29e-13 | Rev p-value = 2.70e-08
3. Josephus vs. Ilan3: Old p-value = 2.2e-16 | Rev p-value = 9.01e-15
4. Historical Novels vs. Ilan3: Old p-value = 2.2e-16 | Rev p-value = 1.32e-14
5. Ilan-1F Fictitious vs. Ilan3: Old p-value = 2.2e-16 | Rev p-value = 2.2e-16
6. Uniform vs. Ilan3: Old p-value = 4.81e-5 | Rev p-value = 8.83e-06 (re-ran with NREP=10000)

Name Origin Results:

7. GA vs. Ilan3: Old p-value = 2.2e-16 | Rev p-value = 2.2e-16
8. Josephus vs. Ilan3: Old p-value = 2.2e-16 | Rev p-value = 2.2e-16
9. Ilan-1F Fictitious vs. Ilan3: Old p-value = 2.2e-16 | Rev p-value = 2.2e-16
10. Uniform vs. Ilan3: Old p-value = 0.2819 | Rev p-value = 0.1259 (re-ran with NREP=10000)

In the process of discovering the issue with the Ilan3 data, and adding the names and re-running the hypothesis tests, we decided to check one additional thing.  The *Exception* variable for Ilan3 contains the following entries, "Hebrew?", "Jew?", "Samaritan?", "Christian convert", and "Christian?".  The figures were presumed to be ethnic Jews in our analysis, and were included by Ilan, but there was some question, as noted.  Since there are so many (800 out of the 1322 entries), we decided to run the name frequency tests with the exceptions both included and excluded.  P-values shown above are for the included cases (from running the same code as the original, but with the updated data).  We have added lines of code to run the same tests with the exceptions excluded (522 entries).  Again, in no case do the p-values change from above or below the benchmark of 0.05/18 = 0.003.

All code below is unchanged from version 16, with the single exception that bin-checking and run lines are *added* for the Ilan3 exclusion case mentioned above.  The text below is the original revision note from version 16:

## Version 16
After publication of the paper, we became aware of two issues that, taken together, we felt warranted a post-publication revision.  The first issue was that the name "Peter" was in our original Gospels and Acts data, but the corresponding name in Ilan is "Petrus".  The second issue is that in the Historical Novels, some of the references are to real historical figures that occur in Ilan's database.  Thus, the Historical Novels warrant a partial removal of names, but not all names, from Ilan in the calculations.  Statistically, any time a single name changes, it changes the corresponding tests' p-value(s).  The principal p-value changes are GA vs. Ilan-1 0.8556 to 0.7939 and Historical Novels vs. Ilan-1 6.517e-15 to 1.092e-08.  Because the above adjustments are small, in no case do the updated p-values cross the threshold used in the paper and change the conclusions.  We consider a public update of these issues important for integrity and transparency to the scholarly community, but they should not be of interest to anyone else.

Since we decided to publish a revision, we made a few additional minor improvements that would have otherwise remained unpublished.  Here is the complete list of revisions:

(1) Added a list of definitions for the many data.frames
(2) Improved the Excel spreadsheet version of the data by adding a few remarks and highlights
(3) Revised the following datasets: 
- Ben-Hur and The Spear (slight corrections) 
- Matthew - one Thaddaius was added (a Greek name)
- Mark - Peter changed to Petrus; no name origin change
- LukeActs - one Judas was added (a Biblical name); the six deacons (Greek names) were included: Stephanus, Philip, Procorus, Nicanor, Timon, Parmenas
- John - NONE
- GA3 and GA already had Thaddaius and the deacons, but Peter was changed to Petrus. 
- B79 Peter changed to Petrus.
(2) Revise chi.square() to include a partial remove option, run on Ben-Hur & The Spear
(3) Update some of the customized name manipulations related to GA3, see Section 4
(4) Seed added for the random sample() function, in order to standardize output for successive runs.  In addition, in the prior version the term "permutation test" was used.  It has been replaced with "bootstrap hypothesis test", which is more proper.


# Functions for user-settable grouping
I wrote one function to perform both a chi-square goodness-of-fit test, as well as a chi-square test of independence.  Additionally, I added an argument to the function, specifically for the Ilan1 case, to remove the GA (and Josephus) names from Ilan1.  There is no need to do that for Ilan-1F and Ilan3, since the GA (and Josephus) are not included in them.  (Warning: It works for Ilan1, but may break for other datasets, and was not written for general-purpose use.)  I later added a function to check for optimal bins, and an additional argument for partial removal was added with this version.

Algorithm

1. Find a good set of breaks for the reference (Ilan1, Ilan3)
2. Tabulate reference
3. Tabulate target (GA, Josephus, Ilan1-F, Historical Novels, Uniform Distribution)
4. Run chi-square goodness-of-fit test
5. Run chi-square test of independence

## Goodness of Fit Function
```{r}
chi.square <- function(Ilan, Target, breaks = c(1, 4, 20, 100, 150), remove = c(FALSE, TRUE, 'partial'), MAD.RMS = FALSE)
{  #CHI.SQUARE is the master function which takes the data in case format and runs both a chi-square goodness-of-fit test of Target with the Reference, as well as removes the Target from the Reference and performs a chi-square test of independence between the two.

#Label the frequency groups
Ilan.name.freq = data.frame(table(Ilan$name), Category=NA)
colnames(Ilan.name.freq) <- c('Name', 'Freq', 'Category')
Ilan.name.freq$Category = ifelse(Ilan.name.freq$Freq<=breaks[1], 'F1', Ilan.name.freq$Category)
Ilan.name.freq$Category = ifelse(Ilan.name.freq$Freq>breaks[1] & Ilan.name.freq$Freq<=breaks[2], 'F2', Ilan.name.freq$Category)
Ilan.name.freq$Category = ifelse(Ilan.name.freq$Freq>breaks[2] & Ilan.name.freq$Freq<=breaks[3], 'F3', Ilan.name.freq$Category)
Ilan.name.freq$Category = ifelse(Ilan.name.freq$Freq>breaks[3] & Ilan.name.freq$Freq<=breaks[4], 'F4', Ilan.name.freq$Category)
Ilan.name.freq$Category = ifelse(Ilan.name.freq$Freq>breaks[4] & Ilan.name.freq$Freq<=breaks[5], 'F5', Ilan.name.freq$Category)
Ilan.name.freq$Category = ifelse(Ilan.name.freq$Freq>breaks[5], 'F6', Ilan.name.freq$Category)

#Obtain Reference frequencies
f.Ilan = Ilan.name.freq %>% 
  group_by(Category) %>%
  summarize(Frequency = sum(Freq))

#Pull the name lists
names.F1 = filter(Ilan.name.freq, Category=='F1')$Name
names.F2 = filter(Ilan.name.freq, Category=='F2')$Name
names.F3 = filter(Ilan.name.freq, Category=='F3')$Name
names.F4 = filter(Ilan.name.freq, Category=='F4')$Name
names.F5 = filter(Ilan.name.freq, Category=='F5')$Name
names.F6 = filter(Ilan.name.freq, Category=='F6')$Name

#Cross-check against target
Tar.name.freq = data.frame(table(Target$name), Category=NA)
colnames(Tar.name.freq) <- c('Name', 'Freq', 'Category')

Tar.name.freq$Category = ifelse(is.na(match(Tar.name.freq$Name, names.F1)), Tar.name.freq$Category, 'F1')
Tar.name.freq$Category = ifelse(is.na(match(Tar.name.freq$Name, names.F2)), Tar.name.freq$Category, 'F2')
Tar.name.freq$Category = ifelse(is.na(match(Tar.name.freq$Name, names.F3)), Tar.name.freq$Category, 'F3')
Tar.name.freq$Category = ifelse(is.na(match(Tar.name.freq$Name, names.F4)), Tar.name.freq$Category, 'F4')
Tar.name.freq$Category = ifelse(is.na(match(Tar.name.freq$Name, names.F5)), Tar.name.freq$Category, 'F5')
Tar.name.freq$Category = ifelse(is.na(match(Tar.name.freq$Name, names.F6)), Tar.name.freq$Category, 'F6')
Tar.name.freq$Category = ifelse(is.na(Tar.name.freq$Category), 'F1', Tar.name.freq$Category) #Any missing names go 'rare'=F1

#Obtain Target frequencies
f.Tar = Tar.name.freq %>% 
  group_by(Category) %>%
  summarize(Frequency = sum(Freq))

#Ad hoc addition needed for the bootstrap test when small samples occur
if(nrow(f.Tar)<6)
{
  f.Tar2 = data.frame(Category = c('F1', 'F2', 'F3', 'F4', 'F5', 'F6'),
                      Frequency = rep(0, times=6))
  data = full_join(f.Tar, f.Tar2, by='Category', relationship='many-to-one')
  data$Category <- factor(data$Category, levels = f.Tar2$Category, ordered=TRUE)
  ind = order(data$Category)
  f.Tar2$Frequency = ifelse(is.na(data[ind,]$Frequency.x), 0, data[ind,]$Frequency.x)
  f.Tar = f.Tar2
}

###Goodness-of-fit test (2 cases: no remove & remove)
# For printing
B = breaks
breaks2 = c(paste('<=',B[1]), 
            paste(B[1]+1,'-',B[2]), 
            paste(B[2]+1,'-',B[3]), 
            paste(B[3]+1,'-',B[4]), 
            paste(B[4]+1,'-',B[5]), 
            paste('>',B[5]))

# Case 1: No remove
if(remove == FALSE)
{
  out1 = chisq.test(f.Tar$Frequency, p = f.Ilan$Frequency/sum(f.Ilan$Frequency), correct = FALSE)
  #print(out1)
  
  out2 = data.frame(Frequency = breaks2, Reference = f.Ilan$Frequency, Target = f.Tar$Frequency, Expected = out1$expected, ChiSq = (out1$observed - out1$expected)^2 / out1$expected)
  #print(out2)
  
  if(MAD.RMS == TRUE)
  {
    out2b = c(MAD = sum(abs(f.Ilan$Frequency[1:6] - rep(sum(f.Ilan$Frequency[1:6])/6, times=6)))/6 ,
              RMS = sqrt(sum((f.Ilan$Frequency[1:6] - rep(sum(f.Ilan$Frequency[1:6])/6, times=6))^2)) )
    print(paste('The MAD is', out2b[1], 'The RMS is', out2b[2]) )
  }#end if
  
}#end if

# Case 2: Remove
if(remove == TRUE)
{
  #Removal
  Tar.name.freq$Freq = -Tar.name.freq$Freq #Set to negative to use "sum" function
  Combined.name.freq = full_join(Ilan.name.freq, Tar.name.freq, by='Name')
  Freq.reduced = apply(Combined.name.freq[,c(2,4)], MARGIN = 1, sum, na.rm=TRUE)
  Combined.name.freq2 = data.frame(Combined.name.freq, Freq.reduced)
  
  # Re-run frequency categories for Ilan
  f.Ilan2 = Combined.name.freq2 %>% 
    group_by(Category.x) %>%
    summarize(Frequency = sum(Freq.reduced))
  
  #Goodness-of-fit
  out1 = chisq.test(f.Tar$Frequency, p = f.Ilan2$Frequency[1:6]/sum(f.Ilan2$Frequency[1:6]), correct = FALSE)
  #print(out1)
  
  out2 = data.frame(Frequency = breaks2, Reference = f.Ilan2$Frequency[1:6], Target = f.Tar$Frequency, Expected = out1$expected, ChiSq = (out1$observed - out1$expected)^2 / out1$expected)
  #print(out2)
  
  if(MAD.RMS == TRUE)
  {
    out2b = c(MAD = sum(abs(f.Ilan$Frequency[1:6] - rep(sum(f.Ilan$Frequency[1:6])/6, times=6)))/6 ,
              RMS = sqrt(sum((f.Ilan$Frequency[1:6] - rep(sum(f.Ilan$Frequency[1:6])/6, times=6))^2)) )
    print(paste('The MAD is', out2b[1], 'The RMS is', out2b[2]) )
  }#end if
  
 }#end if

# Case 3: Partial Remove
# This was added to deal with partial occurrence of historical characters in Ben Hur and other such documents.

if(remove == 'partial')
{
  #Cross-check against target
  Tar.name.freq.p = data.frame(table(Target$In.Ilan1), Category=NA)
  colnames(Tar.name.freq.p) <- c('Name', 'Freq', 'Category')
  
  Tar.name.freq.p$Category = ifelse(is.na(match(Tar.name.freq.p$Name, names.F1)), Tar.name.freq.p$Category, 'F1')
  Tar.name.freq.p$Category = ifelse(is.na(match(Tar.name.freq.p$Name, names.F2)), Tar.name.freq.p$Category, 'F2')
  Tar.name.freq.p$Category = ifelse(is.na(match(Tar.name.freq.p$Name, names.F3)), Tar.name.freq.p$Category, 'F3')
  Tar.name.freq.p$Category = ifelse(is.na(match(Tar.name.freq.p$Name, names.F4)), Tar.name.freq.p$Category, 'F4')
  Tar.name.freq.p$Category = ifelse(is.na(match(Tar.name.freq.p$Name, names.F5)), Tar.name.freq.p$Category, 'F5')
  Tar.name.freq.p$Category = ifelse(is.na(match(Tar.name.freq.p$Name, names.F6)), Tar.name.freq.p$Category, 'F6')
  Tar.name.freq.p$Category = ifelse(is.na(Tar.name.freq.p$Category), 'F1', Tar.name.freq.p$Category) #Any missing names go 'rare'=F1
  
  #Obtain Target frequencies
  f.Tar.p = Tar.name.freq.p %>% 
    group_by(Category) %>%
    summarize(Frequency = sum(Freq))
  
  #Removal
  Tar.name.freq.p$Freq = -Tar.name.freq.p$Freq #Set to negative to use "sum" function
  Combined.name.freq = full_join(Ilan.name.freq, Tar.name.freq.p, by='Name')
  Freq.reduced = apply(Combined.name.freq[,c(2,4)], MARGIN = 1, sum, na.rm=TRUE)
  Combined.name.freq2 = data.frame(Combined.name.freq, Freq.reduced)
  
  # Re-run frequency categories for Ilan
  f.Ilan2 = Combined.name.freq2 %>% 
    group_by(Category.x) %>%
    summarize(Frequency = sum(Freq.reduced))
  
  #Goodness-of-fit
  out1 = chisq.test(f.Tar$Frequency, p = f.Ilan2$Frequency[1:6]/sum(f.Ilan2$Frequency[1:6]), correct = FALSE)
  #print(out1)
  
  out2 = data.frame(Frequency = breaks2, Reference = f.Ilan2$Frequency[1:6], Target = f.Tar$Frequency, Expected = out1$expected, ChiSq = (out1$observed - out1$expected)^2 / out1$expected)
  #print(out2)
  
  if(MAD.RMS == TRUE)
  {
    out2b = c(MAD = sum(abs(f.Ilan$Frequency[1:6] - rep(sum(f.Ilan$Frequency[1:6])/6, times=6)))/6 ,
              RMS = sqrt(sum((f.Ilan$Frequency[1:6] - rep(sum(f.Ilan$Frequency[1:6])/6, times=6))^2)) )
    print(paste('The MAD is', out2b[1], 'The RMS is', out2b[2]) )
  }#end if
  
 }#end if
  

###Test of independence
# Case 1: No removal (direct comparison)
if(remove == FALSE)
{
  out3 = chisq.test(data.frame(f.Tar$Frequency, f.Ilan$Frequency), correct = FALSE)
  out4 = data.frame(Frequency = breaks2, Target = f.Tar$Frequency, Reference = f.Ilan$Frequency, Expected = out3$expected, ChiSq = (out3$observed - out3$expected)^2 / out3$expected)

  #print(out3)
  return(list(out1, out2, out3, out4))
}#end if

# Case 2: Removal (Remove Target from Reference)
if(remove == TRUE)
{
  out3 = chisq.test(data.frame(f.Tar$Frequency, f.Ilan2$Frequency[1:6]), correct = FALSE)
  out4 = data.frame(Frequency = breaks2, Target = f.Tar$Frequency, Reference = f.Ilan2$Frequency[1:6], Expected = out3$expected, ChiSq = (out3$observed - out3$expected)^2 / out3$expected)
  #print(out3)
  return(list(out1, out2, out3, out4))
}#end if

# Case 3: Partial Removal (Remove Partial Target from Reference)
if(remove == 'partial')
{
  out3 = chisq.test(data.frame(f.Tar$Frequency, f.Ilan2$Frequency[1:6]), correct = FALSE)
  out4 = data.frame(Frequency = breaks2, Target = f.Tar$Frequency, Reference = f.Ilan2$Frequency[1:6], Expected = out3$expected, ChiSq = (out3$observed - out3$expected)^2 / out3$expected)
  #print(out3)
  return(list(out1, out2, out3, out4))
}#end if

}#end function
```

## Bin Checking Function
The purpose of the bin-checking function is to permit the user to provide a set of bins and get the MAD and RMS values as output.  They can then be used to determine how good the bins are.  The function is not written for general-purpose use, but for the specific purpose of this project.
```{r}
bin.check <- function(Ilan, Target, breaks = c(1, 4, 20, 100, 150), remove = FALSE)
{

#Label the frequency groups
Ilan.name.freq = data.frame(table(Ilan$name), Category=NA)
colnames(Ilan.name.freq) <- c('Name', 'Freq', 'Category')
Ilan.name.freq$Category = ifelse(Ilan.name.freq$Freq<=breaks[1], 'F1', Ilan.name.freq$Category)
Ilan.name.freq$Category = ifelse(Ilan.name.freq$Freq>breaks[1] & Ilan.name.freq$Freq<=breaks[2], 'F2', Ilan.name.freq$Category)
Ilan.name.freq$Category = ifelse(Ilan.name.freq$Freq>breaks[2] & Ilan.name.freq$Freq<=breaks[3], 'F3', Ilan.name.freq$Category)
Ilan.name.freq$Category = ifelse(Ilan.name.freq$Freq>breaks[3] & Ilan.name.freq$Freq<=breaks[4], 'F4', Ilan.name.freq$Category)
Ilan.name.freq$Category = ifelse(Ilan.name.freq$Freq>breaks[4] & Ilan.name.freq$Freq<=breaks[5], 'F5', Ilan.name.freq$Category)
Ilan.name.freq$Category = ifelse(Ilan.name.freq$Freq>breaks[5], 'F6', Ilan.name.freq$Category)

#Obtain Reference frequencies
f.Ilan = Ilan.name.freq %>% 
  group_by(Category) %>%
  summarize(Frequency = sum(Freq))

#Pull the name lists
names.F1 = filter(Ilan.name.freq, Category=='F1')$Name
names.F2 = filter(Ilan.name.freq, Category=='F2')$Name
names.F3 = filter(Ilan.name.freq, Category=='F3')$Name
names.F4 = filter(Ilan.name.freq, Category=='F4')$Name
names.F5 = filter(Ilan.name.freq, Category=='F5')$Name
names.F6 = filter(Ilan.name.freq, Category=='F6')$Name

#Cross-check against target
Tar.name.freq = data.frame(table(Target$name), Category=NA)
colnames(Tar.name.freq) <- c('Name', 'Freq', 'Category')

Tar.name.freq$Category = ifelse(is.na(match(Tar.name.freq$Name, names.F1)), Tar.name.freq$Category, 'F1')
Tar.name.freq$Category = ifelse(is.na(match(Tar.name.freq$Name, names.F2)), Tar.name.freq$Category, 'F2')
Tar.name.freq$Category = ifelse(is.na(match(Tar.name.freq$Name, names.F3)), Tar.name.freq$Category, 'F3')
Tar.name.freq$Category = ifelse(is.na(match(Tar.name.freq$Name, names.F4)), Tar.name.freq$Category, 'F4')
Tar.name.freq$Category = ifelse(is.na(match(Tar.name.freq$Name, names.F5)), Tar.name.freq$Category, 'F5')
Tar.name.freq$Category = ifelse(is.na(match(Tar.name.freq$Name, names.F6)), Tar.name.freq$Category, 'F6')
Tar.name.freq$Category = ifelse(is.na(Tar.name.freq$Category), 'F1', Tar.name.freq$Category) #Any missing names go 'rare'=F1

#Obtain Target frequencies
f.Tar = Tar.name.freq %>% 
  group_by(Category) %>%
  summarize(Frequency = sum(Freq))
  
###Goodness-of-fit test (2 cases: no remove & remove)
# For printing
B = breaks
breaks2 = c(paste('<=',B[1]), 
            paste(B[1]+1,'-',B[2]), 
            paste(B[2]+1,'-',B[3]), 
            paste(B[3]+1,'-',B[4]), 
            paste(B[4]+1,'-',B[5]), 
            paste('>',B[5]))

# Case 1: No remove
if(remove == FALSE)
{
  out1 = chisq.test(f.Tar$Frequency, p = f.Ilan$Frequency/sum(f.Ilan$Frequency), correct = FALSE)
  #print(out1)
  
  out2 = data.frame(Frequency = breaks2, Reference = f.Ilan$Frequency, Target = f.Tar$Frequency, Expected = out1$expected, ChiSq = (out1$observed - out1$expected)^2 / out1$expected)
  #print(out2)
  
  out2b = c(MAD = sum(abs(f.Ilan$Frequency[1:6] - rep(sum(f.Ilan$Frequency[1:6])/6, times=6)))/6 ,
            RMS = sqrt(sum((f.Ilan$Frequency[1:6] - rep(sum(f.Ilan$Frequency[1:6])/6, times=6))^2)) )

}#end if

# Case 2: Remove
if(remove == TRUE)
{
  #Removal
  Tar.name.freq$Freq = -Tar.name.freq$Freq #Set to negative to use "sum" function
  Combined.name.freq = full_join(Ilan.name.freq, Tar.name.freq, by='Name')
  Freq.reduced = apply(Combined.name.freq[,c(2,4)], MARGIN = 1, sum, na.rm=TRUE)
  Combined.name.freq2 = data.frame(Combined.name.freq, Freq.reduced)
  
  # Re-run frequency categories
  f.Ilan2 = Combined.name.freq2 %>% 
    group_by(Category.x) %>%
    summarize(Frequency = sum(Freq.reduced))
 
  #Goodness-of-fit
  out1 = chisq.test(f.Tar$Frequency, p = f.Ilan2$Frequency[1:6]/sum(f.Ilan2$Frequency[1:6]), correct = FALSE)
  #print(out1)
  
  out2 = data.frame(Frequency = breaks2, Reference = f.Ilan2$Frequency[1:6], Target = f.Tar$Frequency, Expected = out1$expected, ChiSq = (out1$observed - out1$expected)^2 / out1$expected)
  #print(out2)
  
  out2b = c(MAD = sum(abs(f.Ilan2$Frequency[1:6] - rep(sum(f.Ilan2$Frequency[1:6])/6, times=6)))/6 ,
            RMS = sqrt(sum((f.Ilan2$Frequency[1:6] - rep(sum(f.Ilan2$Frequency[1:6])/6, times=6))^2)) )
}#end if
  
return( list(breaks, round(out2b, 1)) )


}#end function
```


# Obtaining the Optimal Bins
Before showing the tests, we need to identify the bins to use.  I adapted the chi-squared function to isolate to statistics on the bins: the mean absolute deviation (MAD), and the root mean square (RMS). The search is shown below.  In the original search, I looked at the actual bin frequencies and made the decision for which bin to adjust at each step based on those. A reader could easily replicate this by using the chi-squared function.

## Optimal Bins for Ilan-1
```{r}
#chi.square(Ilan, GA3)
bin.check(Ilan, GA3, breaks = c(1, 3, 20, 100, 150), remove=TRUE) 
bin.check(Ilan, GA3, breaks = c(1, 4, 20, 100, 150), remove=TRUE) 
bin.check(Ilan, GA3, breaks = c(1, 5, 20, 100, 150), remove=TRUE) 
bin.check(Ilan, GA3, breaks = c(1, 6, 20, 100, 150), remove=TRUE) 

bin.check(Ilan, GA3, breaks = c(1, 5, 25, 100, 150), remove=TRUE) 
bin.check(Ilan, GA3, breaks = c(1, 5, 21, 100, 150), remove=TRUE) 
bin.check(Ilan, GA3, breaks = c(1, 5, 21, 70, 150), remove=TRUE)
bin.check(Ilan, GA3, breaks = c(1, 5, 21, 70, 130), remove=TRUE)  
bin.check(Ilan, GA3, breaks = c(1, 5, 22, 70, 150), remove=TRUE) 
bin.check(Ilan, GA3, breaks = c(1, 5, 22, 100, 150), remove=TRUE) #***
bin.check(Ilan, GA3, breaks = c(1, 5, 24, 100, 150), remove=TRUE) 
bin.check(Ilan, GA3, breaks = c(1, 5, 25, 100, 150), remove=TRUE) 
```
The MAD has some duplicates, but with the RMS it selects an optimal bin.  This will obtain for all subsequent cases as well.

Next, I want to see the effect of selecting the bins with GA not removed from Ilan.
```{r}
#chi.square(Ilan, GA3)
bin.check(Ilan, GA3, breaks = c(1, 3, 20, 100, 150)) 
bin.check(Ilan, GA3, breaks = c(1, 4, 20, 100, 150)) 
bin.check(Ilan, GA3, breaks = c(1, 5, 20, 100, 150)) 
bin.check(Ilan, GA3, breaks = c(1, 6, 20, 100, 150)) 

bin.check(Ilan, GA3, breaks = c(1, 5, 25, 100, 150)) 
bin.check(Ilan, GA3, breaks = c(1, 5, 21, 100, 150)) 
bin.check(Ilan, GA3, breaks = c(1, 5, 21, 70, 150))  #***
bin.check(Ilan, GA3, breaks = c(1, 5, 21, 70, 130))  
bin.check(Ilan, GA3, breaks = c(1, 5, 22, 70, 150)) 
bin.check(Ilan, GA3, breaks = c(1, 5, 22, 100, 150))
bin.check(Ilan, GA3, breaks = c(1, 5, 24, 100, 150)) 
bin.check(Ilan, GA3, breaks = c(1, 5, 25, 100, 150)) 
```
For this case, the MAD and RMS both have the lowest statistic with breaks c(1, 5, 21, 70, 150), which matches the MAD result for Ilan-1 with GA removed.


##  Optimal Bins for Ilan-3
Turning to Ilan-3, We search for breaks as follows:
```{r}
table(table(Ilan.Vol3$name))

bin.check(Ilan.Vol3, GA3, breaks = c(1,2,4,10,20)) #63.8, 216.6
bin.check(Ilan.Vol3, GA3, breaks = c(1,2,4,10,17)) #66.0, 224.3
bin.check(Ilan.Vol3, GA3, breaks = c(1,2,4,10,21)) #63.8, 212.2 ***
bin.check(Ilan.Vol3, GA3, breaks = c(1,2,4,10,25)) #63.8, 212.5

bin.check(Ilan.Vol3, GA3, breaks = c(1,2,5,10,21)) #77.7, 233.2
bin.check(Ilan.Vol3, GA3, breaks = c(1,2,5,10,25)) #77.7, 233.5
bin.check(Ilan.Vol3, GA3, breaks = c(1,2,3,10,21)) #80, 237.6
```

For the Ilan3 exclusion case (see Introduction Revision 17 note, above), we have:
```{r}
table(table(Ilan.Vol3r$name))

bin.check(Ilan.Vol3r, GA3, breaks = c(1,2,4,8,19)) #38.7 116.3 
bin.check(Ilan.Vol3r, GA3, breaks = c(1,2,4,8,18)) #32.3 106.0 
bin.check(Ilan.Vol3r, GA3, breaks = c(1,2,4,8,17)) #30.7 102.1  ***
bin.check(Ilan.Vol3r, GA3, breaks = c(1,2,4,8,16)) #30.7 102.1  ***
bin.check(Ilan.Vol3r, GA3, breaks = c(1,2,4,8,15)) #35.3 110.2
bin.check(Ilan.Vol3r, GA3, breaks = c(1,2,3,8,17)) #34.7 108.7
```


# Ilan-1 Tests
In this section, I will show the output of the chi-squared tests. The function automatically produces both the goodness of fit test (first), as well as the chi-squared test of independence (second).

## GA vs. Ilan-1
### Names removed from Ilan-1
For this test, for the paper, we are removing the GA data from Ilan-1:
```{r}
chi.square(Ilan, GA3, breaks = c(1, 5, 22, 100, 150), remove=TRUE) #***
```

### Names not removed from Ilan-1
We also show here, however, the result when not removing the data:
```{r}
Pop.GA.Ilan1 = chi.square(Ilan, GA3, breaks = c(1, 5, 21, 70, 150), remove=FALSE) #***
Pop.GA.Ilan1
```

### 6 deacons removed
We also show in the paper the p-value with the 6 deacons from Acts 6:5 removed:
```{r}
deacons = c('Stephen', 'Philip', 'Procharus', 'Nicanor', 'Timon', 'Parmenas')
ind = match(deacons, GA3$name)
GA4 = GA3[-ind,]

nrow(GA4)
chi.square(Ilan, GA4, breaks = c(1, 5, 22, 100, 150), remove=TRUE) #***
chi.square(Ilan, GA4, breaks = c(1, 5, 22, 100, 150), remove=FALSE) #***
```

### Uncontested names only
We also show the p-value with just the 26 uncontested names:
```{r}
### For Attested/thinned
thinned = c('Joshua', 'Jacob', 'Yohanan', 'Peter', 'Marcus', 'Sheila', 'Agrippa', 'Hananiah', 'Hanan', 'Archelaus', 'Qaifa', 'Gamaliel', 'Herod', 'Herod', 'Yohanan', 'Judah', 'Philip', 'Theodorus', 'Andreas', 'Abba', 'Justus', 'Jacob', 'Judah', 'Mattathias', 'Philip', 'Toma')
ind = match(thinned, GA3$name)
GA3.thin = GA3[ind,]

chi.square(Ilan, GA3.thin, breaks = c(1, 5, 22, 100, 150), remove=TRUE) #***
```

### Contested names only
We also show the p-value with just the 53 contested names:
```{r}
### For Attested/thinned
thinned = c('Joshua', 'Jacob', 'Yohanan', 'Petrus', 'Marcus', 'Sheila', 'Agrippa', 'Hananiah', 'Hanan', 'Archelaus', 'Qaifa', 'Gamaliel', 'Herod', 'Herod', 'Yohanan', 'Judah', 'Philip', 'Theodorus', 'Andreas', 'Abba', 'Justus', 'Jacob', 'Judah', 'Mattathias', 'Philip', 'Toma')
ind = match(thinned, GA3$name)
GA3.53 = GA3[-ind,]

chi.square(Ilan, GA3.53, breaks = c(1, 5, 22, 100, 150), remove=TRUE) #***
```


## Josephus vs. Ilan-1
Here are the results with his data removed, for the paper:
```{r}
chi.square(Ilan, Josephus, breaks = c(1, 5, 22, 100, 150), remove=TRUE) #***
```

We also show the non-removed results, for reference
```{r}
Pop.Joe.Ilan1 = chi.square(Ilan, Josephus, breaks = c(1, 5, 21, 70, 150), remove=FALSE) #***
Pop.Joe.Ilan1
```


## Historical Novels vs. Ilan-1
```{r}
Pop.HistNov.Ilan1 = chi.square(Ilan, HistNov, breaks = c(1, 5, 21, 70, 150), remove='partial')
Pop.HistNov.Ilan1
```

Below is The Spear alone:
```{r}
#The Spear
chi.square(Ilan, HistNov[33:98,], breaks=c(1, 5, 21, 70, 150), remove='partial')
```

Below is Ben Hur alone.  Six bins works.
```{r}
chi.square(Ilan, HistNov[1:32,], breaks=c(1, 5, 21, 70, 150), remove='partial')
```

## Ilan-1F vs. Ilan-1
```{r}
Pop.Ilan1F.Ilan1 = chi.square(Ilan, Ilan.Vol1.Fict, breaks = c(1, 5, 21, 70, 150), remove=FALSE)
Pop.Ilan1F.Ilan1
```


## Uniform vs. Ilan-1
This test was requested by a reviewer: Construct a uniform distribution, which corresponds to "anonymous community transmission" and run it against Ilan-1, for the purpose of comparison.

Which uniform distribution?  There are uncountably many possibilities: actual uniform distribution using GA names for different sample sizes [49 (one name each - close to the 53 reduced sample size of GB), 98 (two each, close to GA), 294 (six each, close to Josephus), 2205 (45 each, close to Ilan-1)].  This could be repeated with the Josephus name list, random subsets of Ilan-1, etc.  There is no limit to the possibilities, and this is why the null hypothesis is to fit the reference distribution, not the other way around.

The distribution that Anonymous felt was most appropriate in order to exemplify the case that GB were referring to as "anonymous community transmission" was to take the Ilan 457 names and randomly sample an appropriate sample size from that.  We select for the appropriate sample size GB's 52 names (total) after removals.  In order to perform this chi-square test, the sample is not fixed, therefore we will use a random sampling method (bootstrap test).

Algorithm:

1. Let X be a sample of size 52 (without replacement) from Ilan1-458 (old designation, Ilan-457 would make more sense)
2. Obtain the p-value for that test, save it
3. Repeat (1) and (2) for nrep times
4. The final statistic is the mean of the p-values, which will converge to the true value by the law of large numbers.

Note: For the bootstrap confidence intervals, shown in the paper in Figures 2-4, I stored the p-values in the above algorithm and used the 0.025th and 0.975th quantiles of those distributions for the confidence intervals.  The intervals slightly move around for different runs of the data, without substantively altering the visual impression, thereby confirming the method.

```{r}
Ilan458 = names(table(Ilan$name))
nrep = NREP  #Defined globally
pvalues = NA #initialize
set.seed(2024) #Fix the seed, for reproducibility
out.Ilan1.Pop = matrix(NA, nrow=6, ncol=NREP) #initialize

for (i in 1:nrep)
{
  X = data.frame(name = sample(Ilan458, size=GA.size))      #GA.size is defined globally
  out = chi.square(Ilan, X, breaks = c(1, 5, 21, 70, 150), remove=FALSE)
  out.Ilan1.Pop[,i] = out[[2]]$Target/GA.size
  pvalues[i] = out[[1]]$p.value #for pvalues
}

pvalue = mean(pvalues) #2.24e-13

summary(pvalues)     #max of 6.78e-10, with nrep=10000
sum(pvalues>2.2e-16) #614 > 2.2e-16

pvalues = sort(pvalues)
#Pop.Unif.Ilan1 = c(mean = mean(pvalues), quantile(pvalues, probs = c(0.025, 0.975)))
```

# Ilan-3 Tests
## GA vs. Ilan-3
```{r}
Pop.GA.Ilan3 = chi.square(Ilan.Vol3, GA3, breaks=c(1,2,4,10,21), remove=FALSE)
Pop.GA.Ilan3
```

Same test, but for the exclusions removed case (see revision 17 note in the Introduction, n=522)
```{r}
chi.square(Ilan.Vol3r, GA3, breaks=c(1,2,4,8,17), remove=FALSE)
```


## Josephus vs. Ilan-3
```{r}
Pop.Joe.Ilan3 = chi.square(Ilan.Vol3, Josephus, breaks=c(1,2,4,10,21), remove=FALSE)
Pop.Joe.Ilan3
```

Same test, but for the exclusions removed case (see revision 17 note in the Introduction, n=522)
```{r}
chi.square(Ilan.Vol3r, Josephus, breaks=c(1,2,4,8,17), remove=FALSE)
```

## Historical Novels vs. Ilan-3
```{r}
Pop.HistNov.Ilan3 = chi.square(Ilan.Vol3, HistNov, breaks=c(1,2,4,10,21), remove='partial')
Pop.HistNov.Ilan3
```

Same test, but for the exclusions removed case (see revision 17 note in the Introduction, n=522)
```{r}
chi.square(Ilan.Vol3r, HistNov, breaks=c(1,2,4,8,17), remove=FALSE)
```

Below is The Spear alone, by request:
```{r}
#The Spear
chi.square(Ilan.Vol3, HistNov[33:98,], breaks=c(1,2,4,10,21), remove='partial')
```

Below is Ben Hur alone, by request.  The run with 6 bins has sample size too small, so I manually combine them into 3 bins:
```{r}
chi.square(Ilan.Vol3, HistNov[1:32,], breaks=c(1,2,4,10,21), remove='partial')

Ref = c(394, 160+190, 160+146+169)
Tar = c(21, 4+1, 2+1+2)
chisq.test(Tar, p=Ref/sum(Ref), correct=FALSE)
```

## Ilan-1F Fictitious vs. Ilan-3
```{r}
Pop.Ilan1F.Ilan3 = chi.square(Ilan.Vol3, Ilan.Vol1.Fict, breaks = c(1,2,4,10,21), remove=FALSE)
Pop.Ilan1F.Ilan3
```

Same test, but for the exclusions removed case (see revision 17 note in the Introduction, n=522)
```{r}
chi.square(Ilan.Vol3r, Ilan.Vol1.Fict, breaks=c(1,2,4,8,17), remove=FALSE)
```

## Uniform Distributions vs. Ilan-3
```{r}
Ilan575 = names(table(Ilan.Vol3$name))
nrep = NREP
pvalues = NA #initialize
set.seed(2024) #Fix the seed, for reproducibility
out.Ilan3.Pop = matrix(NA, nrow=6, ncol=NREP) #initialize

for (i in 1:nrep)
{
  X = data.frame(name = sample(Ilan575, size=GA.size))
  out = chi.square(Ilan.Vol3, X, breaks = c(1,2,4,10,21), remove=FALSE)
  out.Ilan3.Pop[,i] = out[[2]]$Target/GA.size
  pvalues[i] = out[[1]]$p.value
}

pvalue = mean(pvalues) #7.16e-33

summary(pvalues)     #max of 4.40e-29, with nrep=10000
sum(pvalues>2.2e-16) #0 > 2.2e-16
```


# Name Origin Tests
## GA vs. Ilan-1
```{r}
f.GA = apply(GA[,4:9], 2, sum, na.rm=TRUE)
f.Ilan = apply(Ilan[,9:14], 2, sum, na.rm=TRUE) #No Arab or Egyptian
f.Ilan.rem = f.Ilan - f.GA

# Removed
out = chisq.test(f.GA, p = f.Ilan.rem/sum(f.Ilan.rem), correct = FALSE)
out
data.frame(Reference=f.Ilan.rem,  Observed = f.GA, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)

#Not Removed
out = chisq.test(f.GA, p = f.Ilan/sum(f.Ilan), correct = FALSE)
out
Ori.GA.Ilan1 = data.frame(Reference=f.Ilan,  Observed = f.GA, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)
Ori.GA.Ilan1
```

Here is the p-value with combining Semitic Greek and Semitic Hebrew (deacons in):
```{r}
### With the 6 deacons
f.GA.combined = f.GA[-3] #Remove Semitic.Greek, which is 0, so no loss
f.Ilan.rem.combined = f.Ilan.rem[-3] #Remove Semitic.Greek, which is 48
f.Ilan.rem.combined[2] = 196+48      #Add the 48 back in

out = chisq.test(f.GA.combined, p = f.Ilan.rem.combined/sum(f.Ilan.rem.combined), correct = FALSE)
out
data.frame(Reference=f.Ilan.rem.combined,  Observed = f.GA.combined, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)
```

We also included the p-value in the paper with the six deacons removed, which is calculated here:
```{r}
GA4 = GA[-c(43, 41, 37, 48, 39),]
GA4[13,'Greek'] <- 2  #Remove one Philip

f.GA = apply(GA4[,4:9], 2, sum, na.rm=TRUE)
sum(f.GA)
f.Ilan = apply(Ilan[,9:14], 2, sum, na.rm=TRUE) #No Arab or Egyptian
f.Ilan.rem = f.Ilan - f.GA

# Removed
out = chisq.test(f.GA, p = f.Ilan.rem/sum(f.Ilan.rem), correct = FALSE)
out
data.frame(Reference=f.Ilan.rem,  Observed = f.GA, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)
```

```{r}
### With the 6 deacons
f.GA.combined = f.GA[-3] #Remove Semitic.Greek, which is 0, so no loss
f.Ilan.rem.combined = f.Ilan.rem[-3] #Remove Semitic.Greek, which is 48
f.Ilan.rem.combined[2] = 196+48      #Add the 48 back in

out = chisq.test(f.GA.combined, p = f.Ilan.rem.combined/sum(f.Ilan.rem.combined), correct = FALSE)
out
data.frame(Reference=f.Ilan.rem.combined,  Observed = f.GA.combined, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)
```


## Josephus vs. Ilan-1
```{r}
f.Joe = apply(Josephus[,9:14], 2, sum, na.rm=TRUE)
f.Ilan = apply(Ilan[,9:14], 2, sum, na.rm=TRUE)
f.Ilan.rem = f.Ilan-f.Joe

#Removed
out = chisq.test(f.Joe, p = f.Ilan.rem/sum(f.Ilan.rem), correct = FALSE)
out
data.frame(Reference=f.Ilan.rem,  Observed = f.Joe, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)

#Not Removed
out = chisq.test(f.Joe, p = f.Ilan/sum(f.Ilan), correct = FALSE)
out
Ori.Joe.Ilan1 = data.frame(Reference=f.Ilan,  Observed = f.Joe, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)
Ori.Joe.Ilan1
```

## Ilan-1F Fictitious vs. Ilan-1
```{r}
f.Vol1.Fict = apply(Ilan.Vol1.Fict[,10:16], 2, sum, na.rm=TRUE)
f.Ilan = apply(Ilan[,9:15], 2, sum, na.rm=TRUE)

out = chisq.test(f.Vol1.Fict, p = f.Ilan/sum(f.Ilan), correct = FALSE)
out
Ori.Ilan1F.Ilan1 = data.frame(Reference=f.Ilan,  Observed = f.Vol1.Fict, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)
Ori.Ilan1F.Ilan1
```

## Uniform vs. Ilan-1
Sample size n=53 (no need to do n=83). 
```{r}
#Set up Ilan frequencies
f.Ilan = apply(Ilan[,9:14], 2, sum, na.rm=TRUE)

#Set up loop
nrep = NREP      #number of replicates, for bootstrap test
pvalues = NA   #initialize
set.seed(2024) #Fix the seed, for reproducibility
out.Ilan1.Origin = matrix(NA, nrow=6, ncol=NREP) #initialize

#Ilan.reduced data.frame to sample from
Ilan.reduced = Ilan[,c(2,9:14)] %>%
  group_by(name, Biblical, Semitic.Hebrew, Semitic.Greek, Greek, Latin, Persian) %>%
  summarize(n())
Ilan.size = nrow(Ilan.reduced)

#Resample and perform test nrep times
for (i in 1:nrep)
{
  index = sample(1:Ilan.size, size=GA.size, replace=TRUE)
  X = Ilan.reduced[index,]
  f.Unif = apply(X[,2:7], 2, sum, na.rm=TRUE)
  out = chisq.test(f.Unif, p = f.Ilan/sum(f.Ilan), correct = FALSE)
  out.Ilan1.Origin[,i] = out$observed/GA.size
  pvalues[i] = out$p.value
}

pvalue = mean(pvalues) #3.13e-06
summary(pvalues)       #Max is 6.16e-03
```
It is worth noting that 13 of the names were classified by Ilan with more than one origin.  For example, Abba occurs 16 times as Semitic.Hebrew, but twice as Semitic.Greek.  For the uniform distribution, all names occur once, unless they have two different origins, they are included once for each origin.  There are 457 unique names, and 13 which have a second origin listed, for a total of 470 names sampled from.

## GA vs. Ilan-3
```{r}
f.GA = apply(GA[,c(4:9,11)], 2, sum, na.rm=TRUE)
f.Ilan = apply(Ilan.Vol3[,c(10:15,17)], 2, sum, na.rm=TRUE) #Egyptian, but No Arab or Iranian

out = chisq.test(f.GA, p = f.Ilan/sum(f.Ilan), correct = FALSE)
out
Ori.GA.Ilan3 = data.frame(Reference=f.Ilan,  Observed = f.GA, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)
Ori.GA.Ilan3
```

## Josephus vs. Ilan-3
```{r}
f.Joe = apply(Josephus[,c(9:14,16)], 2, sum, na.rm=TRUE)
f.Ilan = apply(Ilan.Vol3[,c(10:15,17)], 2, sum, na.rm=TRUE) #Egyptian, but No Arab or Iranian

out = chisq.test(f.Joe, p = f.Ilan/sum(f.Ilan), correct = FALSE)
out
Ori.Joe.Ilan3 = data.frame(Reference = f.Ilan, Observed = f.Joe, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)
Ori.Joe.Ilan3
```

## Ilan-1F Fictitious vs. Ilan-3
```{r}
f.Vol1.Fict = apply(Ilan.Vol1.Fict[,c(10:17)], 2, sum, na.rm=TRUE)
f.Ilan = apply(Ilan.Vol3[,c(10:17)], 2, sum, na.rm=TRUE) #Egyptian, but No Arab or Iranian

out = chisq.test(f.Vol1.Fict, p = f.Ilan/sum(f.Ilan), correct = FALSE)
out
Ori.Ilan1F.Ilan3 = data.frame(Reference=f.Ilan,  Observed = f.Vol1.Fict, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)
Ori.Ilan1F.Ilan3
```

## Uniform vs. Ilan-3
Same algorithm as for Uniform vs. Ilan-1.  Interestingly, there are no names with two origins listed in Ilan-3.
```{r}
#Set up Ilan frequencies
f.Ilan = apply(Ilan.Vol3[,c(10:15,17)], 2, sum, na.rm=TRUE)

#Set up loop
nrep = NREP      #number of replicates, for bootstrap test
pvalues = NA   #initialize
set.seed(2024) #Fix the seed, for reproducibility
out.Ilan3.Origin = matrix(NA, nrow=7, ncol=NREP) #initialize

#Ilan.reduced data.frame to sample from
Ilan.reduced = Ilan.Vol3[,c(2,10:15,17)] %>%
  group_by(name, Biblical, Semitic.Hebrew, Semitic.Greek, Greek, Latin, Persian, Egyptian) %>%
  summarize(n())
Ilan.size = nrow(Ilan.reduced)

#Resample and perform test nrep times
for (i in 1:nrep)
{
  index = sample(1:Ilan.size, size=GA.size, replace=TRUE)
  X = Ilan.reduced[index,]
  f.Unif = apply(X[,2:8], 2, sum, na.rm=TRUE)
  out = chisq.test(f.Unif, p = f.Ilan/sum(f.Ilan), correct = FALSE)
  out.Ilan3.Origin[,i] = out$observed/GA.size
  pvalues[i] = out$p.value
}

pvalue = mean(pvalues) #mean p-value = 0.2814
summary(pvalues)
```

This is a re-run of the same, but with sample size 82
```{r}
#Set up Ilan frequencies
f.Ilan = apply(Ilan.Vol3[,c(10:15,17)], 2, sum, na.rm=TRUE)

#Set up loop
nrep = NREP      #number of replicates, for bootstrap test
pvalues = NA   #initialize
set.seed(2024) #Fix the seed, for reproducibility
out.Ilan3.Origin = matrix(NA, nrow=7, ncol=NREP) #initialize

#Ilan.reduced data.frame to sample from
Ilan.reduced = Ilan.Vol3[,c(2,10:15,17)] %>%
  group_by(name, Biblical, Semitic.Hebrew, Semitic.Greek, Greek, Latin, Persian, Egyptian) %>%
  summarize(n())
Ilan.size = nrow(Ilan.reduced)

#Resample and perform test nrep times
for (i in 1:nrep)
{
  index = sample(1:Ilan.size, size=82, replace=TRUE)
  X = Ilan.reduced[index,]
  f.Unif = apply(X[,2:8], 2, sum, na.rm=TRUE)
  out = chisq.test(f.Unif, p = f.Ilan/sum(f.Ilan), correct = FALSE)
  out.Ilan3.Origin[,i] = out$observed/82
  pvalues[i] = out$p.value
}

pvalue = mean(pvalues) #mean p-value = 0.2008
summary(pvalues)
```

# Some Additional p-values requested by Vandeweghe for footnotes

## Bauckham's 79 (footnote 22)
### Popularity
```{r}
chi.square(Ilan, B79, breaks = c(1, 5, 22, 100, 150), remove=TRUE) #***
```

### Origin
```{r}
f.GA = apply(B79[,4:9], 2, sum, na.rm=TRUE)
f.Ilan = apply(Ilan[,9:14], 2, sum, na.rm=TRUE) #No Arab or Egyptian
f.Ilan.rem = f.Ilan - f.GA

# Removed
out = chisq.test(f.GA, p = f.Ilan.rem/sum(f.Ilan.rem), correct = FALSE)
out
data.frame(Reference=f.Ilan.rem,  Observed = f.GA, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)
```

## GB's 53 (footnote 22)
### Popularity
```{r}
chi.square(Ilan, GB53, breaks = c(1, 5, 22, 100, 150), remove=TRUE) #***
```

### Origin
```{r}
f.GA = apply(GB53[,3:8], 2, sum, na.rm=TRUE)
f.Ilan = apply(Ilan[,9:14], 2, sum, na.rm=TRUE) #No Arab or Egyptian
f.Ilan.rem = f.Ilan - f.GA

# Removed
out = chisq.test(f.GA, p = f.Ilan.rem/sum(f.Ilan.rem), correct = FALSE)
out
data.frame(Reference=f.Ilan.rem,  Observed = f.GA, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)
```

## Gospels, individually (footnote 23)
### Matthew
#### Popularity
```{r}
chi.square(Ilan, Matthew, breaks = c(1, 5, 22, 100, 150), remove=TRUE) #***
```

#### Origin
```{r}
Ilan.mod = data.frame(Greek=Ilan$Greek, Latin=Ilan$Latin, Persian=Ilan$Persian, 
                      Semitic.Greek=Ilan$Semitic.Greek, Semitic.Hebrew=Ilan$Semitic.Hebrew,
                      Biblical=Ilan$Biblical)
f.GA = apply(Matthew[,3:8], 2, sum, na.rm=TRUE)
f.Ilan = apply(Ilan.mod, 2, sum, na.rm=TRUE) #No Arab or Egyptian
f.Ilan.rem = f.Ilan - f.GA

# Removed
out = chisq.test(f.GA, p = f.Ilan.rem/sum(f.Ilan.rem), correct = FALSE)
out
data.frame(Reference=f.Ilan.rem,  Observed = f.GA, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)
```

### Mark
#### Popularity
```{r}
chi.square(Ilan, Mark, breaks = c(1, 5, 22, 100, 150), remove=TRUE) #***
```

#### Origin
```{r}
Ilan.mod = data.frame(Greek=Ilan$Greek, Latin=Ilan$Latin, Persian=Ilan$Persian, 
                      Semitic.Greek=Ilan$Semitic.Greek, Semitic.Hebrew=Ilan$Semitic.Hebrew,
                      Biblical=Ilan$Biblical)
f.GA = apply(Mark[,3:8], 2, sum, na.rm=TRUE)
f.Ilan = apply(Ilan.mod, 2, sum, na.rm=TRUE) #No Arab or Egyptian
f.Ilan.rem = f.Ilan - f.GA

# Removed
out = chisq.test(f.GA, p = f.Ilan.rem/sum(f.Ilan.rem), correct = FALSE)
out
data.frame(Reference=f.Ilan.rem,  Observed = f.GA, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)
```

### Luke-Acts
#### Popularity
```{r}
chi.square(Ilan, LukeActs, breaks = c(1, 5, 22, 100, 150), remove=TRUE) #***
```

#### Origin
```{r}
Ilan.mod = data.frame(Greek=Ilan$Greek, Latin=Ilan$Latin, Persian=Ilan$Persian, 
                      Semitic.Greek=Ilan$Semitic.Greek, Semitic.Hebrew=Ilan$Semitic.Hebrew,
                      Biblical=Ilan$Biblical)
f.GA = apply(LukeActs[,3:8], 2, sum, na.rm=TRUE)
f.Ilan = apply(Ilan.mod, 2, sum, na.rm=TRUE) #No Arab or Egyptian
f.Ilan.rem = f.Ilan - f.GA

# Removed
out = chisq.test(f.GA, p = f.Ilan.rem/sum(f.Ilan.rem), correct = FALSE)
out
data.frame(Reference=f.Ilan.rem,  Observed = f.GA, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)
```

### John
#### Popularity
```{r}
chi.square(Ilan, John, breaks = c(1, 5, 22, 100, 150), remove=TRUE) #***
```

#### Origin
```{r}
Ilan.mod = data.frame(Greek=Ilan$Greek, Latin=Ilan$Latin, Persian=Ilan$Persian, 
                      Semitic.Greek=Ilan$Semitic.Greek, Semitic.Hebrew=Ilan$Semitic.Hebrew,
                      Biblical=Ilan$Biblical)
f.GA = apply(John[,3:8], 2, sum, na.rm=TRUE)
f.Ilan = apply(Ilan.mod, 2, sum, na.rm=TRUE) #No Arab or Egyptian
f.Ilan.rem = f.Ilan - f.GA

# Removed
out = chisq.test(f.GA, p = f.Ilan.rem/sum(f.Ilan.rem), correct = FALSE)
out
data.frame(Reference=f.Ilan.rem,  Observed = f.GA, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)
```


## Footnote 57, Cephas instead of Peter
### Hellenists Excluded
```{r}
Ilan.mod = data.frame(Greek=Ilan$Greek, Latin=Ilan$Latin, Persian=Ilan$Persian, 
                      Semitic.Greek=Ilan$Semitic.Greek, Semitic.Hebrew=Ilan$Semitic.Hebrew,
                      Biblical=Ilan$Biblical)
f.GA = apply(GA_Cephas_for_Peter_Deacons_Excluded[,3:8], 2, sum, na.rm=TRUE)
f.Ilan = apply(Ilan.mod, 2, sum, na.rm=TRUE) #No Arab or Egyptian
f.Ilan.rem = f.Ilan - f.GA

# Removed
out = chisq.test(f.GA, p = f.Ilan.rem/sum(f.Ilan.rem), correct = FALSE)
out
data.frame(Reference=f.Ilan.rem,  Observed = f.GA, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)
```

### Hellenists Included
```{r}
Ilan.mod = data.frame(Greek=Ilan$Greek, Latin=Ilan$Latin, Persian=Ilan$Persian, 
                      Semitic.Greek=Ilan$Semitic.Greek, Semitic.Hebrew=Ilan$Semitic.Hebrew,
                      Biblical=Ilan$Biblical)
f.GA = apply(GA_Cephas_for_Peter_Deacons_Included[,4:9], 2, sum, na.rm=TRUE)
f.Ilan = apply(Ilan.mod, 2, sum, na.rm=TRUE) #No Arab or Egyptian
f.Ilan.rem = f.Ilan - f.GA

# Removed
out = chisq.test(f.GA, p = f.Ilan.rem/sum(f.Ilan.rem), correct = FALSE)
out
data.frame(Reference=f.Ilan.rem,  Observed = f.GA, Expected = out$expected, ChiSq = (out$observed - out$expected)^2 / out$expected)
```

# Graphs
## Graph Docs vs. Ilan Vol 1 Popularity
```{r}
names = c('1', '2-5', '6-21', '22-70', '71-150', '151+')
names <- factor(names, levels = names)

p.GA = Pop.GA.Ilan1[[2]]$Target/sum(Pop.GA.Ilan1[[2]]$Target)
p.Joe = Pop.Joe.Ilan1[[2]]$Target/sum(Pop.Joe.Ilan1[[2]]$Target)
p.Ilan1F = Pop.Ilan1F.Ilan1[[2]]$Target/sum(Pop.Ilan1F.Ilan1[[2]]$Target)
p.HistNov = Pop.HistNov.Ilan1[[2]]$Target/sum(Pop.HistNov.Ilan1[[2]]$Target)
p.Unif = apply(out.Ilan1.Pop, MARGIN=1, FUN=quantile, probs=0.50)

sde.GA = sqrt(p.GA*(1-p.GA)/sum(Pop.GA.Ilan1[[2]]$Target) )
sde.Joe = sqrt(p.Joe*(1-p.Joe)/sum(Pop.Joe.Ilan1[[2]]$Target) )
sde.Ilan1F = sqrt(p.Ilan1F*(1-p.Ilan1F)/sum(Pop.Ilan1F.Ilan1[[2]]$Target) )
sde.HistNov = sqrt(p.HistNov*(1-p.HistNov)/sum(Pop.HistNov.Ilan1[[2]]$Target) )
lower.unif = apply(out.Ilan1.Pop, MARGIN=1, FUN=quantile, probs=0.025)
upper.unif = apply(out.Ilan1.Pop, MARGIN=1, FUN=quantile, probs=0.975)

data.GA = data.frame(Text='Gospels+Acts', names, 
                     Ilan = Pop.GA.Ilan1[[2]]$Reference/sum(Pop.GA.Ilan1[[2]]$Reference), prob=p.GA,
                     lower=p.GA-1.96*sde.GA, upper=p.GA+1.96*sde.GA)
data.Joe = data.frame(Text='Josephus', names, 
                      Ilan = Pop.Joe.Ilan1[[2]]$Reference/sum(Pop.Joe.Ilan1[[2]]$Reference), prob=p.Joe,
                      lower=p.Joe-1.96*sde.Joe, upper=p.Joe+1.96*sde.Joe)
data.Ilan1F = data.frame(Text='Fictitious', names, 
                      Ilan = Pop.Ilan1F.Ilan1[[2]]$Reference/sum(Pop.Ilan1F.Ilan1[[2]]$Reference), prob=p.Ilan1F,
                      lower=p.Ilan1F-1.96*sde.Ilan1F, upper=p.Ilan1F+1.96*sde.Ilan1F)
data.HistNov = data.frame(Text='Historical Novels', names, 
                      Ilan = Pop.HistNov.Ilan1[[2]]$Reference/sum(Pop.HistNov.Ilan1[[2]]$Reference), prob=p.HistNov,
                      lower=p.HistNov-1.96*sde.HistNov, upper=p.HistNov+1.96*sde.HistNov)
data.Unif = data.frame(Text='Uniform', names, 
                      Ilan = Pop.GA.Ilan1[[2]]$Reference/sum(Pop.GA.Ilan1[[2]]$Reference), prob=p.Unif,
                      lower=lower.unif, upper=upper.unif)


data = rbind(data.GA, data.Joe, data.Ilan1F, data.HistNov, data.Unif)
data[,3:6] = 100*data[,3:6]

g = ggplot(data, aes(names, Ilan))
g1 = g + geom_col(position = "dodge") +
  geom_errorbar(aes(ymin=lower, ymax=upper, color=Text), linewidth=1.25, position = position_dodge2(padding = 1))  + xlab("Name Occurrence Category of Ilan-1") + ylab("% of Names in Category") +
  theme_bw() + scale_colour_grey() +
  ylim(0, 89) + 
  theme(legend.position = c(0.95, 0.95),
    legend.justification = c("right", "top"))
g1
```
Normally it is bad practice to have so much white space in your graph.  However, I set the ylim to match the Ilan-3 graph, because otherwise I fear some readers will misread the CI as being wider than those of the other graph


## Graph Docs vs. Ilan Vol 3 Popularity
```{r}
#names = paste('F', 1:6, sep="")
names = c('1', '2', '3-4', '5-10', '11-21', '22+')
names <- factor(names, levels = names)

p.GA = Pop.GA.Ilan3[[2]]$Target/sum(Pop.GA.Ilan3[[2]]$Target)
p.Joe = Pop.Joe.Ilan3[[2]]$Target/sum(Pop.Joe.Ilan3[[2]]$Target)
p.Ilan1F = Pop.Ilan1F.Ilan3[[2]]$Target/sum(Pop.Ilan1F.Ilan3[[2]]$Target)
p.HistNov = Pop.HistNov.Ilan3[[2]]$Target/sum(Pop.HistNov.Ilan3[[2]]$Target)
p.Unif = apply(out.Ilan3.Pop, MARGIN=1, FUN=quantile, probs=0.50)

sde.GA = sqrt(p.GA*(1-p.GA)/sum(Pop.GA.Ilan3[[2]]$Target) )
sde.Joe = sqrt(p.Joe*(1-p.Joe)/sum(Pop.Joe.Ilan3[[2]]$Target) )
sde.Ilan1F = sqrt(p.Ilan1F*(1-p.Ilan1F)/sum(Pop.Ilan1F.Ilan3[[2]]$Target) )
sde.HistNov = sqrt(p.HistNov*(1-p.HistNov)/sum(Pop.HistNov.Ilan3[[2]]$Target) )
lower.unif = apply(out.Ilan3.Pop, MARGIN=1, FUN=quantile, probs=0.025)
upper.unif = apply(out.Ilan3.Pop, MARGIN=1, FUN=quantile, probs=0.975)

data.GA = data.frame(Text='Gospels+Acts', names, 
                     Ilan = Pop.GA.Ilan3[[2]]$Reference/sum(Pop.GA.Ilan3[[2]]$Reference), prob=p.GA,
                     lower=p.GA-1.96*sde.GA, upper=p.GA+1.96*sde.GA)
data.Joe = data.frame(Text='Josephus', names, 
                      Ilan = Pop.Joe.Ilan3[[2]]$Reference/sum(Pop.Joe.Ilan3[[2]]$Reference), prob=p.Joe,
                      lower=p.Joe-1.96*sde.Joe, upper=p.Joe+1.96*sde.Joe)
data.Ilan1F = data.frame(Text='Fictitious', names, 
                      Ilan = Pop.Ilan1F.Ilan3[[2]]$Reference/sum(Pop.Ilan1F.Ilan3[[2]]$Reference), prob=p.Ilan1F,
                      lower=p.Ilan1F-1.96*sde.Ilan1F, upper=p.Ilan1F+1.96*sde.Ilan1F)
data.HistNov = data.frame(Text='Historical Novels', names, 
                      Ilan = Pop.HistNov.Ilan3[[2]]$Reference/sum(Pop.HistNov.Ilan3[[2]]$Reference), prob=p.HistNov,
                      lower=p.HistNov-1.96*sde.HistNov, upper=p.HistNov+1.96*sde.HistNov)
data.Unif = data.frame(Text='Uniform', names, 
                      Ilan = Pop.GA.Ilan3[[2]]$Reference/sum(Pop.GA.Ilan3[[2]]$Reference), prob=p.Unif,
                      lower=lower.unif, upper=upper.unif)

data = rbind(data.GA, data.Joe, data.Ilan1F, data.HistNov, data.Unif)
data[,3:6] = 100*data[,3:6]

g = ggplot(data, aes(names, Ilan))
g2 = g + geom_col(position = "dodge") +
  geom_errorbar(aes(ymin=lower, ymax=upper, color=Text), linewidth=1.25, position = position_dodge2(padding = 1)) + 
  xlab("Name Occurrence Category of Ilan-3") + ylab("% of Names in Category") +
  theme_bw() + scale_colour_grey() +
  theme(legend.position = "none")
g2

#png(file="Barplot_Popularity.png", width=750, height=550, res=NA)
grid.arrange(g1,g2, nrow=1, ncol=2)
# dev.off()

# pdf(file="Barplot_Popularity.pdf", width=7.5, height=5.5, compress=FALSE)
# grid.arrange(g1,g2, nrow=1, ncol=2)
# #dev.off()

```


## Graph Docs, vs. Ilan Vol. 1 - Origin Stats
```{r}
names = rownames(Ori.Ilan1F.Ilan1)
#G = names[4]; SG = names[3]; names[3:4] <- c(G,SG)
names <- factor(names, levels = names) #*** It requires the use of "<-" as opposed to "="

p.GA1 = Ori.GA.Ilan1$Observed/sum(Ori.GA.Ilan1$Observed)
p.Joe1 = Ori.Joe.Ilan1$Observed/sum(Ori.Joe.Ilan1$Observed)
p.Ilan1F = Ori.Ilan1F.Ilan1$Observed/sum(Ori.Ilan1F.Ilan1$Observed)
p.Unif = apply(out.Ilan1.Origin, MARGIN=1, FUN=quantile, probs=0.50)
p.GA = c(p.GA1, 0)   #Add an "Arabian" frequency of 0
p.Joe = c(p.Joe1, 0)
p.Unif = c(p.Unif, 0)

sde.GA = sqrt(p.GA*(1-p.GA)/sum(Ori.GA.Ilan1$Observed) )
sde.Joe = sqrt(p.Joe*(1-p.Joe)/sum(Ori.Joe.Ilan1$Observed) )
sde.Ilan1F = sqrt(p.Ilan1F*(1-p.Ilan1F)/sum(Ori.Ilan1F.Ilan1$Observed) )
lower.unif = apply(out.Ilan1.Origin, MARGIN=1, FUN=quantile, probs=0.025)
upper.unif = apply(out.Ilan1.Origin, MARGIN=1, FUN=quantile, probs=0.975)
lower.unif = c(lower.unif, 0)
upper.unif = c(upper.unif, 0)


data.GA = data.frame(Text='Gospels+Acts', names, 
                     Ilan = Ori.Ilan1F.Ilan1$Reference/sum(Ori.Ilan1F.Ilan1$Reference), prob=p.GA,
                     lower=p.GA-1.96*sde.GA, upper=p.GA+1.96*sde.GA)
data.Joe = data.frame(Text='Josephus', names, 
                      Ilan = Ori.Ilan1F.Ilan1$Reference/sum(Ori.Ilan1F.Ilan1$Reference), prob=p.Joe,
                      lower=p.Joe-1.96*sde.Joe, upper=p.Joe+1.96*sde.Joe)
data.Ilan1F = data.frame(Text='Fictitious', names, 
                      Ilan = Ori.Ilan1F.Ilan1$Reference/sum(Ori.Ilan1F.Ilan1$Reference), prob=p.Ilan1F,
                      lower=p.Ilan1F-1.96*sde.Ilan1F, upper=p.Ilan1F+1.96*sde.Ilan1F)
data.Unif = data.frame(Text='Uniform', names, 
                       Ilan = Ori.Ilan1F.Ilan1$Reference/sum(Ori.Ilan1F.Ilan1$Reference), prob=p.Unif,
                       lower=lower.unif, upper=upper.unif)

data = rbind(data.GA, data.Joe, data.Ilan1F, data.Unif)
data[,3:6] = 100*data[,3:6]

g = ggplot(data, aes(names, Ilan))

#png(file="Barplot_Origin.png", width=750, height=550, res=NA)
g + geom_col(position = "dodge") +
  geom_errorbar(aes(ymin=lower, ymax=upper, color=Text), linewidth=1.25, position = position_dodge2(padding = 1))  + 
  xlab("Name Origin Category of Ilan-1") + ylab("% of Names in Category") +
  theme_bw() + scale_colour_grey() + 
  theme(legend.position = c(0.95, 0.95),
    legend.justification = c("right", "top"))
#dev.off()

# #pdf(file="Barplot_Origin.pdf", width=7.50, height=5.50, compress=FALSE)
# g + geom_col(position = "dodge") +
#   geom_errorbar(aes(ymin=lower, ymax=upper, color=Text), linewidth=1.25, position = position_dodge2(padding = 1))  + 
#   xlab("Name Origin Category of Ilan-1") + ylab("% of Names in Category") +
#   theme_bw() + scale_colour_grey() + 
#   theme(legend.position = c(0.95, 0.95),
#     legend.justification = c("right", "top"))
# #dev.off()
```


## Graph Docs vs. Ilan Vol. 3 - Origin Stats
```{r}
names = rownames(Ori.Ilan1F.Ilan3)
#G = names[4]; SG = names[3]; names[3:4] <- c(G,SG)
names <- factor(names, levels = names) #*** It requires the use of "<-" as opposed to "="

p.GA1 = Ori.GA.Ilan3$Observed/sum(Ori.GA.Ilan3$Observed)
p.Joe1 = Ori.Joe.Ilan3$Observed/sum(Ori.Joe.Ilan3$Observed)
p.Ilan1F = Ori.Ilan1F.Ilan3$Observed/sum(Ori.Ilan1F.Ilan3$Observed)
p.Unif = apply(out.Ilan3.Origin, MARGIN=1, FUN=quantile, probs=0.50)
p.GA = c(p.GA1, 0)   #Both Egyptian and Arabian are 0, this is one of them
p.Joe = c(p.Joe1, 0) #Both Egyptian and Arabian are 0, this is one of them
p.Unif = c(p.Unif[1:6], 0, p.Unif[7]) #Insert 0 for Arabian

sde.GA = sqrt(p.GA*(1-p.GA)/sum(Ori.GA.Ilan3$Observed) )
sde.Joe = sqrt(p.Joe*(1-p.Joe)/sum(Ori.Joe.Ilan3$Observed) )
sde.Ilan1F = sqrt(p.Ilan1F*(1-p.Ilan1F)/sum(Ori.Ilan1F.Ilan3$Observed) )
lower.unif = apply(out.Ilan3.Origin, MARGIN=1, FUN=quantile, probs=0.025)
upper.unif = apply(out.Ilan3.Origin, MARGIN=1, FUN=quantile, probs=0.975)
lower.unif = c(lower.unif[1:6], 0, lower.unif[7]) #Insert 0 for Arabian
upper.unif = c(upper.unif[1:6], 0, upper.unif[7]) #Insert 0 for Arabian

data.GA = data.frame(Text='Gospels+Acts', names, 
                     Ilan = Ori.Ilan1F.Ilan3$Reference/sum(Ori.Ilan1F.Ilan3$Reference), prob=p.GA,
                     lower=p.GA-1.96*sde.GA, upper=p.GA+1.96*sde.GA)
data.Joe = data.frame(Text='Josephus', names, 
                      Ilan = Ori.Ilan1F.Ilan3$Reference/sum(Ori.Ilan1F.Ilan3$Reference), prob=p.Joe,
                      lower=p.Joe-1.96*sde.Joe, upper=p.Joe+1.96*sde.Joe)
data.Ilan1F = data.frame(Text='Fictitious', names, 
                      Ilan = Ori.Ilan1F.Ilan3$Reference/sum(Ori.Ilan1F.Ilan3$Reference), prob=p.Ilan1F,
                      lower=p.Ilan1F-1.96*sde.Ilan1F, upper=p.Ilan1F+1.96*sde.Ilan1F)
data.Unif = data.frame(Text='Uniform', names, 
                       Ilan = Ori.Ilan1F.Ilan3$Reference/sum(Ori.Ilan1F.Ilan3$Reference), prob=p.Unif,
                       lower=lower.unif, upper=upper.unif)

data = rbind(data.GA, data.Joe, data.Ilan1F, data.Unif)
data[,3:6] = 100*data[,3:6]

g = ggplot(data, aes(names, Ilan))

#png(file="Barplot_Origin2.png", width=750, height=550, res=NA)
g + geom_col(position = "dodge") +
  geom_errorbar(aes(ymin=lower, ymax=upper, color=Text), linewidth=1.25, position = position_dodge2(padding = 1)) + 
  xlab("Name Origin Category of Ilan-3") + ylab("% of Names in Category") +
  theme_bw() + scale_colour_grey() + 
  theme(legend.position = c(0.95, 0.95),
        legend.justification = c("right", "top"))
#dev.off()

# #pdf(file="Barplot_Origin2.pdf", width=7.50, height=5.50, compress=FALSE)
# g + geom_col(position = "dodge") +
#   geom_errorbar(aes(ymin=lower, ymax=upper, color=Text), linewidth=1.25, position = position_dodge2(padding = 1)) + 
#   xlab("Name Origin Category of Ilan-3") + ylab("% of Names in Category") +
#   theme_bw() + scale_colour_grey() + 
#   theme(legend.position = c(0.95, 0.95),
#         legend.justification = c("right", "top"))
# #dev.off()
```

## Graph Name Frequencies
### Setting up the data
This is a very dirty calculation, but I want to get done and work on the writing, and this does the job, so I apologize for the messy code.
```{r}
# Set up  the data.frame (very, very dirty alculations)
f.I =table(Ilan$name)
ind = order(f.I, decreasing=TRUE)
f.I2 = data.frame(f.I[ind[1:12]])
names(f.I2) <- c('Name', 'Ilan1')

f.J =table(Josephus$name)
ind = order(f.J, decreasing=TRUE)
f.J2 = data.frame(f.J)
names(f.J2) <- c('Name', 'Josephus')

data1 = left_join(f.I2, GA[,1:2], by='Name')
data2 = left_join(data1, f.J2, by='Name')

rare = c('Rare', 266, 10, 33)
data3 = rbind(rare, data2)
colnames(data3)[3] <- 'Gospels+Acts'
data3[13,3] = 0
#data3[,2:4] <- as.numeric(data3[,2:4])

data4 = data.frame(data3[,1], 
                   100*as.numeric(data3[,2])/2186, 
                   100*as.numeric(data3[,3])/83, 
                   100*as.numeric(data3[,4])/274)
names(data4) <- c('Name', 'Ilan1', 'Gospels+Acts', 'Josephus')

#Prep for plotting
data6 = data4 %>%
  pivot_longer(
    cols = !Name,
    names_to = 'Text',
    values_to = 'Proportion'
  )
data6$Name <- factor(data6$Name, levels = data4$Name[1:13])
data6$PlotKey = rep(1:13, each=3)

```

Here is the plotting code:
```{r}
library(ggformula)

#png(file="Barplot_Names.png", width=750, height=550, res=NA)
filter(data6, Text=='Ilan1') %>% 
  gf_col(Proportion ~ Name) %>%
  gf_line(Proportion ~ PlotKey, color = ~Text, linewidth=1.25,
          data = filter(data6, Text=='Gospels+Acts' | Text=='Josephus')) %>%
  gf_point(Proportion ~ PlotKey, color = ~Text, size=2.25,
          data = filter(data6, Text=='Gospels+Acts' | Text=='Josephus')) %>%
  gf_theme(theme_bw()) %>%
  gf_refine(scale_color_grey()) %>%
  gf_theme(legend.position=c(.90, .87)) %>%
  gf_labs(x = "Names", y = "% of Occurrences of Name")
#dev.off()

# #pdf(file="Barplot_Names.pdf", width=7.50, height=5.50, compress=FALSE)
# filter(data6, Text=='Ilan1') %>% 
#   gf_col(Proportion ~ Name) %>%
#   gf_line(Proportion ~ PlotKey, color = ~Text, linewidth=1.25,
#           data = filter(data6, Text=='Gospels+Acts' | Text=='Josephus')) %>%
#   gf_point(Proportion ~ PlotKey, color = ~Text, size=2.25,
#           data = filter(data6, Text=='Gospels+Acts' | Text=='Josephus')) %>%
#   gf_theme(theme_bw()) %>%
#   gf_refine(scale_color_grey()) %>%
#   gf_theme(legend.position=c(.90, .87)) %>%
#   gf_labs(x = "Names", y = "% of Occurrences of Name")
# #dev.off()
```

# Power Calculations
## Popularity Statistics
A reviewer recommended adding power calculations to the discussion of sample size.  This is a great idea and helps to address that issue while simultaneously expanding the statistical information and filling out the picture further.

Below is a graph which shows the power of the chi-squared goodness-of-fit test to detect differences from the null-distribution.  In other words, it shows the power to correctly detect when the target distribution does not fit the reference.  In the case of GA, it is the probability that when the Gospels and Acts do not fit Ilan, it is able to be detected.  The power is calculated for three different effect sizes: large (0.5), medium (0.3), and small (0.1).  The power is calculated for sample sizes ranging from 20 to 300.  The power is calculated for a significance level of 0.05.  Principally, however, the power for the uniform distribution, which is our alternative hypothesis of interest, is also shown.

```{r}
### First 3 effect sizes
min = 20
max = 300
n = min:max
effect.size = c(0.5, 0.3, 0.1) #large, medium, small
out1 = pwr.chisq.test(N=n, w=effect.size[1], df=5, sig.level=0.05)
out2 = pwr.chisq.test(N=n, w=effect.size[2], df=5, sig.level=0.05)
out3 = pwr.chisq.test(N=n, w=effect.size[3], df=5, sig.level=0.05)

### Effect size for uniform
#1
N0 = length(Ilan$name)
pi = table(Ilan$name)/N0
#sum(pi) #check

#2
N1 = length(pi)
pi.hat = rep(1/N1, times=N1)
#sum(pi.hat) #check

#3
lam = sum(((pi - pi.hat)^2)/pi.hat) #H0: Ilan vs. H1: Discrete Uniform
lam2 = sum(((pi - pi.hat)^2)/pi)    #H0: Discrete Uniform vs. H1: Ilan

#4
X2 = qchisq(0.95, df=5)  #df=5 because of 6 bins, 6-1=5
power = 1-pchisq(X2, df=5, ncp=n*lam)
power2 = 1-pchisq(X2, df=5, ncp=n*lam2)
#lam2 and power2 are using the H0 and H1 as GB, which is awkward to do, but I wanted to see if it would change the result.  lam2 is smaller than lam, but the power is still 1.00 for the n of interest.


### Plot
plot(n, out1$power, type='l', col='blue', lwd=2, lty=1, xlab='Sample Size', ylab='Power', ylim=c(0,1))
points(n, out2$power, type='l', col='orange', lwd=2, lty=2)
points(n, out3$power, type='l', col='green', lwd=2, lty=3)
points(n, power, type='l', col='red', lwd=2, lty=4)

grid()
legend('right', c('Large Effect', 'Medium Effect', 'Small Effect', 'Uniform'), col=c('blue', 'orange', 'green', 'red'), lty=1:4, lwd=2)
```

The standard reference text, Cohen J. (1988), supplied the small, medium, and large effect sizes, which are 0.1, 0.3, and 0.5, respectively.  An appropriate question to ask is: When the target doesn't fit the reference, and we calculate power under the alternative hypothesis, what is the appropriate alternative hypothesis?  The dominant competing hypothesis is that proposed by Gregor & Blais, namely the uniform distribution.  This is why we calculated the approximate the effect size equivalent for the uniform distribution.

How did we do it?  Agresti (2002) p. 244 says to determine the approximate power for a chi-squared test with df = v, do the following:

1. set the H0 values {pi_i} (Ilan)
2. calculate {pi_i(M)} by fitting to {pi_i} the model M (discrete uniform)
3. calculate the noncentrality parameter ncp with
n*sum(((pi_i - pi_i(M))^2)/pi_i(M))
4. calculate power as P(X^2_v,ncp > X^2_v(alpha))

These four steps were implemented in the above code, and are numbered for clarity.  The result is that there is a very large difference between the Ilan vs. the uniform.  As a result, the test has very high power across the different sample sizes.  What is the interpretation of high power?  It means that when the uniform distribution is true (i.e. the authors made up the names randomly), and the test concludes that at the 5% significance level, then it is almost certain that the correct decision was made.

## Origin
The same basic result obtains for the origin statistics, when using 6 categories (lambda, which is ncp/n, is only 2.1, whereas for the popularity statistics it was 11.9).  The power is lowest at 0.994 for n=20, but moves to 0.9996 by n=30, and climbing.
```{r}
#1
N0 = apply(select(Ilan, Biblical:Persian), MARGIN=2, FUN=sum, na.rm=TRUE)
pi = N0/sum(N0)
#sum(pi) #check

#2
N1 = length(pi)
pi.hat = rep(1/N1, times=N1)
#sum(pi.hat) #check

#3
lam = sum(((pi - pi.hat)^2)/pi.hat) #H0: Ilan vs. H1: Discrete Uniform
lam2 = sum(((pi - pi.hat)^2)/pi)    #H0: Discrete Uniform vs. H1: Ilan

#4
X2 = qchisq(0.95, df=5)
power = 1-pchisq(X2, df=5, ncp=n*lam)
power2 = 1-pchisq(X2, df=5, ncp=n*lam2)
#lam2 and power2 are using the H0 and H1 as GB, which is awkward to do, but I wanted to see if it would change the result.  lam is smaller than lam2.  The power is still 1.00 for the n of interest for both.
```

For the paper, Josephus, the combined historical novels, and the fictional name set all have sample sizes substantially larger than n=52, and so it is clear from the same code that their power calculations will also yield 1.00, because the ncp will be substantially larger.

Statistically, it may seem strange that the power is so large.  The explanation is that the null and alternative are quite far from one another.  The discrete Ilan-1 is like an exponential distribution - very different than a uniform.  Take that and combined it with sample sizes 50+ and it pushes the non-centrality parameter way up resulting in large chi-squared values.

## Example 6-sided die
The first power for the 6-sided die example is 
```{r}
#1
pi = rep(1/6, times=6)
#sum(pi) #check

#2
N1 = c(8, 8, 8, 12, 12, 12)
n = sum(N1)
pi.hat = N1/n
#sum(pi.hat) #check

#3
lam = sum(((pi - pi.hat)^2)/pi.hat)

#4
X2 = qchisq(0.95, df=6-1)
1-pchisq(X2, df=6-1, ncp=n*lam)
#lam2 and power2 are using the H0 and H1 as GB, which is awkward to do, but I wanted to see if it would change the result.  lam is smaller than lam2.  The power is still 1.00 for the n of interest for both.
```

The second power for the 6-sided die example is 
```{r}
#1
pi = rep(1/6, times=6)
#sum(pi) #check

#2
N1 = c(5, 5, 5, 15, 15, 15)
n = sum(N1)
pi.hat = N1/n
#sum(pi.hat) #check

#3
lam = sum(((pi - pi.hat)^2)/pi.hat)

#4
X2 = qchisq(0.95, df=6-1)
1-pchisq(X2, df=6-1, ncp=n*lam)
#lam2 and power2 are using the H0 and H1 as GB, which is awkward to do, but I wanted to see if it would change the result.  lam is smaller than lam2.  The power is still 1.00 for the n of interest for both.
```

# Definitions of Data Frame Labels in .RData file
Each of the items below is the name of a data.frame in the file *Gospels-Acts-Ilan3.RData*.

- B79: Bauckham's 79 names in the Gospels and Acts
- GA: Gospels and Acts with frequency listed for each name
- GA3: the standard format file of Gospels & Acts names which we primarily use in the work
- GA_Cephas_for_Peter_Deacons_excluded: GA3 with Cephas in place of Peter and deacons excluded
- GA_Cephas_for_Peter_Deacons_included: GA3 with Cephas in place of Peter and deacons included
- GB: the original version of Ilan from Gregor & Blaise
- GB53: related to GA3, but with the 53 unconteseted names listed by Gregor & Blaise
- Ilan: our version of Ilan, created from GB, with slight name adjustments as described in the paper
- Ilan.Complete: The full Ilan data scrape, with all volumes combined.  It is not a perfectly clean dataset, but this could be useful for future research and we offer it to the scholarly community.
- Ilan.Vol1: Derived from Ilan.Complete, this is our uncritical cut that is comparable to GB, as discussed in the paper.  Note that we favored GB in the paper, as described.
- Ilan.Vol1.Fict: Derived from Ilan.Complete, this is our scrape that includes the fictitious names, as discussed in the paper.
- Ilan.Vol3: Derived from Ilan.Complete, this is our scrape that includes the names from Volume 3, which are Diaspora Jews.
- John: only the names from the Gospel of John
- Josephus: the names from Josephus, as adapted from Gregor & Blaise
- LukeActs: only the names from Luke-Acts
- Mark: only the names from the Gospel of Mark
- Matthew: only the names from the Gospel of Matthew